We want to understand which chemical properties influence the quality of red wines.
The data set can be downloaded here and metadata information is available here.
We will use the following R libraries: ggplot2, gridExtra, GGally, stats and memisc.
To read the dataset into a variable reds, I will use R read.csv function.
The variables names of the dataset and their structures are displayed as follow:
reds <- read.csv('wineQualityReds.csv')
names(reds)
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
str(reds)
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
This dataset has 1599 rows or observations in it with 13 variables. Each observation represents an evalution of a red wine quality.
I will first explore the dataset one variable at a time. Then I will explore two variables, and multiple variables. And finally I will predict the red wine quality based of some chemical properties.
I will use the qplot function to plot a histogram of each feature. We will also display some descriptive statistics about each feature using summary function. Then we will interpret the distribution of each feature.
We write a custom function called getOutlierExtremes to return weak and strong outliers extremes. The calculation is based on inter-quantile range (IQR).
If we subtract 1.5 x IQR from the first quartile, any data values that are less than this number are considered at least as weak outliers. Similarly if we add 1.5 x IQR to the third quartile, any data values that are greater than this number are considered at least as weak outliers. And applying the same rule based on 3 x IQR, we determine strong outliers.
getOutlierExtremes <- function(feature){
desc <- summary(feature)
#Calcuting the interquatile range
iqr <- IQR(feature)
#Weak outliers extremes
weakLow <- desc[2] - (1.5 * iqr)
weakHigh <- desc[5] + (1.5 * iqr)
#Strong outliers extremes
strongLow <- desc[2] - (3 * iqr)
strongHigh <- desc[5] + (3 * iqr)
values <- c(iqr, weakLow, strongLow, weakHigh, strongHigh)
names(values) <- c('IQR', 'Weak Low Ext.','Strong Low Ext.','Weak High Ext.','Strong High Ext.')
return (values)
}
summary(reds$fixed.acidity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
getOutlierExtremes(reds$fixed.acidity)
## IQR Weak Low Ext. Strong Low Ext. Weak High Ext.
## 2.10 3.95 0.80 12.35
## Strong High Ext.
## 15.50
The feature fixed.acidity is not normally-distributed. But the feature is normally-distributed on logarithmic base.
The feature fixed.acidity has some strong outliers.
summary(reds$volatile.acidity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
getOutlierExtremes(reds$volatile.acidity)
## IQR Weak Low Ext. Strong Low Ext. Weak High Ext.
## 0.250 0.015 -0.360 1.015
## Strong High Ext.
## 1.390
The feature volatile.acidity is normally-distributed with some strong outliers.
summary(reds$citric.acid)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
getOutlierExtremes(reds$citric.acid)
## IQR Weak Low Ext. Strong Low Ext. Weak High Ext.
## 0.330 -0.405 -0.900 0.915
## Strong High Ext.
## 1.410
The feature citric.acid has many 0 values, is not normally-distributed and has some weak outliers.
summary(reds$residual.sugar)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
getOutlierExtremes(reds$residual.sugar)
## IQR Weak Low Ext. Strong Low Ext. Weak High Ext.
## 0.70 0.85 -0.20 3.65
## Strong High Ext.
## 4.70
The feature residual.sugar has many strong outliers, has a long tail, and is not normally-distributed.
summary(reds$chlorides)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
getOutlierExtremes(reds$chlorides)
## IQR Weak Low Ext. Strong Low Ext. Weak High Ext.
## 0.02 0.04 0.01 0.12
## Strong High Ext.
## 0.15
The feature chlorides has some weak outliers on the left side, and strong outliers on the right side. It has a long tail, and is not normally-distributed.
summary(reds$free.sulfur.dioxide)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
getOutlierExtremes(reds$free.sulfur.dioxide)
## IQR Weak Low Ext. Strong Low Ext. Weak High Ext.
## 14 -14 -35 42
## Strong High Ext.
## 63
The feature free.sulfur.dioxide has strong outliers, and is not normally-distributed.
summary(reds$total.sulfur.dioxide)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
getOutlierExtremes(reds$total.sulfur.dioxide)
## IQR Weak Low Ext. Strong Low Ext. Weak High Ext.
## 40 -38 -98 122
## Strong High Ext.
## 182
The feature total.sulfur.dioxide has strong outliers, and is not normally-distributed.
summary(reds$density)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
getOutlierExtremes(reds$density)
## IQR Weak Low Ext. Strong Low Ext. Weak High Ext.
## 0.0022350 0.9922475 0.9888950 1.0011525
## Strong High Ext.
## 1.0045050
The feature density is normally-distributed. It has weak outliers on both sides.
summary(reds$pH)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
getOutlierExtremes(reds$pH)
## IQR Weak Low Ext. Strong Low Ext. Weak High Ext.
## 0.190 2.925 2.640 3.685
## Strong High Ext.
## 3.970
The feature pH is normally-distributed. It has weak outliers on the left side and strong outliers on the right side.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
## IQR Weak Low Ext. Strong Low Ext. Weak High Ext.
## 0.18 0.28 0.01 1.00
## Strong High Ext.
## 1.27
The feature sulphates is not normally-distributed and has strong outliers.
summary(reds$alcohol)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
getOutlierExtremes(reds$alcohol)
## IQR Weak Low Ext. Strong Low Ext. Weak High Ext.
## 1.6 7.1 4.7 13.5
## Strong High Ext.
## 15.9
The feature alcohol is not normally-distributed and has weak outliers.
summary(reds$quality)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
getOutlierExtremes(reds$quality)
## IQR Weak Low Ext. Strong Low Ext. Weak High Ext.
## 1.0 3.5 2.0 7.5
## Strong High Ext.
## 9.0
The feature quality is normally-distributed. It has weak outliers on both side.
I will first explore the target feature (quality) relation with each input feature. Then I will explore the relationships between input features.
Good wines seem to have higher fixed.acidity.
Good wines seem to have lower volatile.acidity.
Good red wines seem to have higher citric.acid.
Residual sugar did not seem to have a dramatic impact on the quality the red wines.
Good red wines seem to have lower chlorides.
Sulfur dioxides did not seem to have a dramatic impact on the quality of the red wines.
Good red wines seem to have lower density.
Good red wines seem to have lower pH.
Good red wines seem to have lower sulphates.
Good wines seem to have higher alcohol.
I will use ggpairs to visualize the correlations between features in the red wines dataset.
I assume existence of correlation between two features where the correlation coefficient is outside the intervale [-0.32, 0.32].
Some observations:
I will compute a new categorial variable based on quality called rating. For red wines quality rated less than 5, rating will have the value ‘bad’. For red wines quality rated 5 or 6, rating will have the value ‘average’. And rating will have value for redquality egal 7 and above. This feature will be used to color our plots.
The features volatile.acidity and alcohol are correlated to the target feature quality. I test the independence between the features volatile.acidity and alcohol using pearson correlation method. The density and pH features are normally distributed. But the density feature is strongly correlated to the alcohol feature. The pH feature is neither correlated to the alcohol, neither to the volatile.acidity feature. I thought of creating a categorial variable based on pH, but I abondonned the idea as all pH in the dataset are less than 7, specific to acidic solutions. The rest of the features where not normally distributed even after log10 or sqrt transformations and are not correlated to the quality feature.
Let test the independence of the feature alcohol and the feature volatile.acidity using correlation test with pearson method.
#Compute correlation between features alcohol and volatile.acidity for independence
cor.test(reds$volatile.acidity, reds$alcohol, method = 'pearson')
##
## Pearson's product-moment correlation
##
## data: reds$volatile.acidity and reds$alcohol
## t = -8.2546, df = 1597, p-value = 3.155e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2488416 -0.1548020
## sample estimates:
## cor
## -0.202288
With 95 percent confidence interval between -0.2488416 and -0.1548020 of correlation coefficient, the features volatile.acidity and alcohol are independent.
I display boxplots of alcohol/quality and volatile.acidity/quality color coded by quality rating.
From these boxplots, the quality of red wine is good when the acohol percentage is high. And the quality of red wine is good at low values of the volatile.acidity.
I display the points plot of volatile.acidity/alcohol features color coded with quality rating feature.
Again the red wines rated good are at low values of volatile.acidity and high values of alcohol.
bads <- subset(reds, reds$rating == 'bad')
cor.test(bads$quality, bads$volatile.acidity, method = 'pearson')
##
## Pearson's product-moment correlation
##
## data: bads$quality and bads$volatile.acidity
## t = -2.3049, df = 61, p-value = 0.02459
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.49602400 -0.03794003
## sample estimates:
## cor
## -0.2830444
cor.test(bads$quality, bads$alcohol, method = 'pearson')
##
## Pearson's product-moment correlation
##
## data: bads$quality and bads$alcohol
## t = 0.97924, df = 61, p-value = 0.3313
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1272830 0.3610418
## sample estimates:
## cor
## 0.1244053
A correlation doesn’t exist between quality and volatile.acidity for bad red wines. And a correlation doesn’t exist between quality and alcohol for bad red wines.
averages <- subset(reds, reds$rating == 'average')
cor.test(averages$quality, averages$volatile.acidity, method = 'pearson')
##
## Pearson's product-moment correlation
##
## data: averages$quality and averages$volatile.acidity
## t = -8.8607, df = 1317, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2874883 -0.1855937
## sample estimates:
## cor
## -0.2371933
cor.test(averages$quality, averages$alcohol, method = 'pearson')
##
## Pearson's product-moment correlation
##
## data: averages$quality and averages$alcohol
## t = 14.69, df = 1317, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3278896 0.4206801
## sample estimates:
## cor
## 0.3752245
A correlation doesn’t exist between quality and volatile.acidity for average red wines. But a correlation exists between quality and alcohol for average red wines.
goods <- subset(reds, reds$rating == 'good')
cor.test(goods$quality, goods$volatile.acidity, method = 'pearson')
##
## Pearson's product-moment correlation
##
## data: goods$quality and goods$volatile.acidity
## t = 0.54322, df = 215, p-value = 0.5875
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.0966390 0.1693712
## sample estimates:
## cor
## 0.03702191
cor.test(goods$quality, goods$alcohol, method = 'pearson')
##
## Pearson's product-moment correlation
##
## data: goods$quality and goods$alcohol
## t = 2.592, df = 215, p-value = 0.0102
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.0418609 0.3002971
## sample estimates:
## cor
## 0.1740748
A correlation doesn’t exist between quality and volatile.acidity for good red wines. But a correlation doesn’t exist between quality and alcohol for average red wines.
goodOrBads <- subset(reds, reds$rating != 'average')
cor.test(goodOrBads$quality, goodOrBads$volatile.acidity, method = 'pearson')
##
## Pearson's product-moment correlation
##
## data: goodOrBads$quality and goodOrBads$volatile.acidity
## t = -12.876, df = 278, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.6797322 -0.5321147
## sample estimates:
## cor
## -0.6112117
cor.test(goodOrBads$quality, goodOrBads$alcohol, method = 'pearson')
##
## Pearson's product-moment correlation
##
## data: goodOrBads$quality and goodOrBads$alcohol
## t = 9.7474, df = 278, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4118364 0.5871768
## sample estimates:
## cor
## 0.5046932
A correlation exists between quality and volatile.acidity for a subset of the data for good or bad red wines. A correlation exists between quality and alcohol for a subset of the data for good or bad red wines.
I will use a linear model to predict the quality of red wines based on independent input features.
##
## Calls:
## m1: lm(formula = quality ~ alcohol, data = reds)
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = reds)
## m3: lm(formula = quality ~ alcohol + volatile.acidity + pH, data = reds)
##
## ===============================================
## m1 m2 m3
## -----------------------------------------------
## (Intercept) 1.875*** 3.095*** 4.269***
## (0.175) (0.184) (0.369)
## alcohol 0.361*** 0.314*** 0.330***
## (0.017) (0.016) (0.017)
## volatile.acidity -1.384*** -1.279***
## (0.095) (0.099)
## pH -0.422***
## (0.115)
## -----------------------------------------------
## R-squared 0.227 0.317 0.323
## adj. R-squared 0.226 0.316 0.321
## sigma 0.710 0.668 0.665
## F 468.267 370.379 253.328
## p 0.000 0.000 0.000
## Log-likelihood -1721.057 -1621.814 -1615.101
## Deviance 805.870 711.796 705.845
## AIC 3448.114 3251.628 3240.202
## BIC 3464.245 3273.136 3267.087
## N 1599 1599 1599
## ===============================================
With small R-Squared values, the linear model seems not adapted in predicting the quality of red wines for the entire dataset.
##
## Calls:
## m1: lm(formula = quality ~ alcohol, data = subset(reds, rating !=
## "average"))
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = subset(reds,
## rating != "average"))
## m3: lm(formula = quality ~ alcohol + volatile.acidity + pH, data = subset(reds,
## rating != "average"))
##
## ===============================================
## m1 m2 m3
## -----------------------------------------------
## (Intercept) -0.668 2.605*** 6.055***
## (0.724) (0.646) (1.237)
## alcohol 0.625*** 0.475*** 0.539***
## (0.064) (0.053) (0.056)
## volatile.acidity -3.319*** -2.834***
## (0.274) (0.308)
## pH -1.329**
## (0.409)
## -----------------------------------------------
## R-squared 0.255 0.513 0.531
## adj. R-squared 0.252 0.509 0.525
## sigma 1.201 0.973 0.957
## F 95.012 145.657 103.979
## p 0.000 0.000 0.000
## Log-likelihood -447.573 -388.120 -382.861
## Deviance 400.961 262.223 252.556
## AIC 901.146 784.239 775.722
## BIC 912.050 798.778 793.896
## N 280 280 280
## ===============================================
For a subset of the data reduced to good wines and bad wines only, we note improved values for R-Squared.
I will display histogram of wine quality to see how quality is distributed in our dataset.
The quality rating with highest number is 6. And we can see that most of wines in our dataset is rated between 5 and 7.
I will display a histogram of wine alcohol. Each bar is proportionally color coded using the quality rating.
The color coded histogram shows that the proportion of good wines is increasing when alcohol percentage is increasing.
I will display a scatter plot of good wines and bad wines with volatile acidity on the x-axis and alcohol on the y-axis.
The colored scatter plot shows that good wines are on high alcohol percentage and low volatile acidity density. It also shows at the same time that bad wines are on low alcohol percentage and high volatile acidity density.
Based on the EDA, alcohol percentage is one of the most important factor that impacts the quality of red wine. Another important factor impacting the quality of a red wine is volatile Acidity.
We have been able to better predict red wine quality for good wines and bad wines based on Alcohol and volatle acidity features. But for average quality rating, the prediction of the quality is performing poorly. Then we have not been able to determine the input features that impact average quality rating red wines.
In this case for example, where the quality is determined by tasting the wine by experts, there is room for subjectivity. I will recommend using a model like random forest to predict the quality.
Another issue was validating the model. Because we didn’t divide the dataset into model training data and testing data, there is a risk of over fitting the data regardless of the model.